Internet Documents: A Rich Source for Spoken Language Modeling

نویسندگان

  • D. Vaufreydaz
  • M. Akbar
  • J. Rouillard
چکیده

Spoken language speech recognition systems need better understanding of natural spoken language phenomenon than their dictation counterparts. Current language models are mostly based on written text and/or very tedious Wizard of Oz or real dialog experiments. In this paper we propose to use Internet documents as a very rich source of information for spoken language modeling. Through detailed experiments we show how using Internet we could automatically prepare language models adapted to a given task. For a given recognition system using this approach the word accuracy is up to 15% better than a system using language models trained on written text.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploring the Use of Significant Words Language Modeling for Spoken Document Retrieval

Owing to the rapid global access to tremendous amounts of multimedia associated with speech information on the Internet, spoken document retrieval (SDR) has become an emerging application recently. Apart from much effort devoted to developing robust indexing and modeling techniques for spoken documents, a recent line of research targets at enriching and reformulating query representations in an...

متن کامل

Language Modeling for Spoken Dialogue System based on Filtering using Predicate-Argument Structures

We present a novel scheme of language modeling for a spoken dialogue system by effectively filtering query sentences collected via a Web site of wisdom of crowds. Our goal is a speechbased information navigation system by retrieving from backend documents such as Web news. Then, we expect that users make queries that are relevant to the backend documents. The relevance measure can be defined wi...

متن کامل

The TOBIAS test generator and its adaptation to some ASE challenges Position paper for the ASE Irvine Workshop

In the past decade, a scientific community has emerged around the notion of “Automated Software Engineering”. This community has made several advances in two kinds of challenges: the complexity of processing software engineering information, and the difficulty to capture knowledge about software. This position paper first recalls these challenges. It then describes how these challenges influenc...

متن کامل

Language Modeling Approach for Retrieving Passages in Lecture Audio Data

Spoken Document Retrieval (SDR) is a promising technology for enhancing the utility of spoken materials. After the spoken documents have been transcribed by using a Large Vocabulary Continuous Speech Recognition (LVCSR) decoder, a text-based ad hoc retrieval method can be applied directly to the transcribed documents. However, recognition errors will significantly degrade the retrieval performa...

متن کامل

Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping

Spoken term detection is important for retrieval of multimedia and spoken content over the Internet. Because it is difficult to have acoustic/language models well matched to the huge quantities of spoken documents produced under various conditions, unsupervised approaches using frame-based dynamic time warping (DTW) has been proposed to compare the spoken query with spoken documents frame by fr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999